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Abstract. In this paper we present a combinatorial model of sequence to shape maps. Our par- 
ticular construction arises in the context of representing nucleotide interactions beyond Watson- 
Crick base pairs and its key feature is to replace sterical by combinatorial constraints. We show 
that these combinatory maps produce exponentially many shapes and induce sets of sequences 
which contain extended connected sub graphs of diameter n, i.e. we show that exponentially 
many shapes have neutral networks. 



1. Introduction 



1.1. Background. Arguably one of the greatest challenges in present day biophysics is the under- 
standing of sequence structure relations of bio polymers. For one particular class of bio polymers, 
the ribonucleic acid (RNA) secondary structures, (Fig.[T]) molecular folding maps have been system- 
atically analyzed by Schuster et. al. [3 |27l [26] . Folding maps play a central role in understanding 
the evolution of molecular sequences. Specific properties like, for instance shape space covering 
[28] and neutral networks (Fig. [7|) [24] are critical for what may be paraphrased as "molecular 
computation by white noise" . For instance, neutral networks played a central role in the Science 
publication authored by E. Schultes and P. Bartels One sequence, two ribozymes: implications for 
the emergence of new ribozyme folds, (v289, n5478, 448-452) where the authors designed experi- 
mentally a single RNA sequence (whose existence is implied by the intersection theorem in |24J) 
that folds into two different, non-related, RNA secondary structures [6]. Exhaustive enumeration 
of sequence spaces and subsequent detailed analysis of the mappings for G,C-sequences of length 
30 were undertaken in pTJ [12] . In addition detailed analysis of neutral networks as well as exhaus- 
tive enumeration of G,C,A,U-sequences can be found in [9]. The findings were intriguing. Folding 
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FIGURE 1. RNA secondary structures. Diagram representation (top): the primary 
sequence, GAGAGCCUUUGGACCUCA, is drawn horizontally and its backbone 
bonds are ignored. All bonds are drawn in the upper half plane and secondary structures 
have the property that no two arcs intersect and all arcs have minimum length 2. Outer 
planar graph representation (bottom). 



maps into RNA secondary structures exhibit a collection of distinct properties which makes them 
ideally suited for evolutionary optimization. 

(a) Many structures have preimages of sequences (neutral networks) which have large components 
and large diameter. 

(b) Many structures have the property that any two of them have neutral networks that come close 
in sequence space. 

Obviously, (a) is of central importance in the context of neutral evolution. Since replication is 
erroneous and only few if not single nucleotides can be exchanged the preimages of structures must 
contain large connected components, (b) showed that (many) new structures can easily be found 
during a random walk on a neutral network using only steps in which a single nucleotide is altered 
(point mutations). 

Folding maps, however, are not obtained analytically. They are a result of a computer algorithm, 
based on the combinatorial analysis of RNA secondary structures pioneered by Waterman et.al. 
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FIGURE 2. The neutral network of a structure. Sequence space (right) and shape space 
(left) represented as lattices. We draw the edges between two sequences bold if they map 
into the one particular structure on the left. The two key properties of neutral nets are 
their connectivity and percolation. They allow sequences to move while maintaining a 
shape through sequence space. 



[2"5l 1501 15T] . It has to be remarked in this context that comparative sequence analysis [3H QI5] pro- 
vides more reliable means for determining the secondary structure of biological RNA [J], i.e. folding 
maps represent already an abstraction. In order to step beyond the secondary structure paradigm 
two main approaches with distinct goals are: (1) to study more advanced nucleotide interactions in 
RNA, like for instance pseudoknots, base triples or (2) consider genuine abstractions of molecular 
structures not aiming to model a biophysical folding map. In |15j we pursue the first by devel- 
oping the combinatorics of RNA structures with pseudoknots and in this contribution the second 
by studying combinatory maps. While (1) eventually produces the mathematical framework en- 
abling us to derive more advanced representations (which eventually result in folding algorithms 
capable of producing structures like phenylalanine tRNA) (2) provides insights on the core ques- 
tion of which principles produce sequence to structure maps suitable for evolution. A type (2) 
abstraction inevitably evokes skepticism since what can possibly be gained if no attempt is made 
to mimic the biological reality? However, we argue that sometimes it is exactly the right strategy 
to fundamentally understand the object under investigation. 
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1.2. Structures and correlations. A well studied class of maps over sequence spaces are the 
NK-landscapes introduced by Kauffman [17j , where each index (locus) of a binary n-tuple viewed 
as the genotype composed by n loci is randomly linked to K other indices. The idea is that a locus i 
makes a contribution to the total fitness of the genotype which depends on the value of the allele (0 
or 1) at i and the values at each of the epistatically linked loci. To each of those 2 K+1 combinations 
there is a value (fitness) assigned uniformly at random. The apparent lack of neutrality led Barnett 
[2] to refine NK-landscapes by NKp-landscapes, introducing a probability p with which an arbitrar- 
ily chosen allelic combination makes no contribution to the fitness. Our approach is connected to 
Kauffmann's intuition in that we consider a molecular structure as a combinatorial representation 
of nucleotide-correlations. As for nucleotide-correlations observations (a) and (b) are not bound 
to the particular concept of RNA secondary structures. For instance Stadler et.al. [29] as well as 
Bastolla et.al. have shown 3| that neutral networks exist for proteins, where nucleotide interac- 
tions are much more involved |21j . Therefore it is certainly not the uniqueness of Watson-Crick 
base pairings implying the existence of neutral networks. Our particular approach comes from 
this correlation perspective and observations from molecular interaction in RNA molecules. First 




accdbdbacbd 




ACCDBDBACBD 




A CCDBDDACBC 

FIGURE 3. Beyond secondary structures. Suppose we are given an abtract alphabet 
{A, B, C, D} with base pairs {{A, B}, {D, C}, {D, B}}. We present diagram represen- 
tations of a secondary structure (top), 3-noncrossing structure (middle) and a 2-diagram 
structure (bottom). The difference between the first two structures is the crossing of 
bonds and the difference between the second two is the number of interactions for a 
nucleotide. 

there are secondary and tertiary interactions [4] , the latter typically involving secondary structural 
elements. Furthermore interaction within RNA molecules can be categorized into three classes, 
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hclix-hclix interaction, loop/bulge- helix and loop-loop interaction [33, 3]. The structure of pheny- 
lalanine tRNA, and the hammerhead ribozyme [32] have served as paradigms in this context. Base 
triples and tetra-loops, as well as pseudoknots, [33J UHl [1] representing loop- loop interactions have 
led to generalizations of the secondary structure concept. These interactions are subject to steric 
constraints arising from the biochemistry of the interactions involved. These observations give rise 
to two different combinatorial abstractions: the consideration of fc-noncrossing chemical bonds and 
of 2-diagrams i.e. a graph whose vertices are drawn as a horizontal line having degree less than 
two (and the combination of them, fc-noncrossing 2-diagrams). The notion of fc-noncrossing arises 
naturally in the context of pseudoknots leading to the concepts of fc-noncrossing RNA structures 
[15] and to Stadler's bi-secondary structures [13] (which are exactly the planar 3-noncrossing RNA 
structures) . The notion of 2-diagrams comes up when restricting nucleotide interactions to at most 
two and therefore allowing the expression or interactions of secondary structure elements. 



2. The Basic Construction 



The notion of 2-diagrams discussed in the introduction is exactly the motivation of our particular 
approach. In the following we detail how to derive molecular shapes in which each nucleotide 
has at most two interactions but which, in difference to biophysical structures, have combinatorial 
constraints on their nucleotide interactions. This idea is to the best of our knowledge new. For 
a given alphabet base pairing rules specify which nucleotides can pair. However, not any two 
nucleotides are able to establish a bond. For instance, they may be restricted by conditions like 
no two edges can cross each other when representing a shape as a diagram |13j . The non-crossing 
condition and uniqueness of base pairs are two key properties of RNA secondary structures and 
allow for Motzkin-path enumeration and tree bijections [25] [31] [35] [30] [14] . We replace these 
restrictions on nucleotide interactions by stipulating that (a) there exists some base graph H 
whose sole purpose is to restrict all possible correlations and (b) we are given a symmetric relation 
3?, tantamount to a base pairing rule. In order to avoid any confusion we work over the abstract 
alphabet {A,B,C,D}. 

In this framework a shape § of a sequence is then the unique maximal i7-subgraph subject to the 
property that for any S-edge the incident nucleotides satisfy IR. It is remarkable that this simple 
definition already produces a well defined sequence to structure map! Moreover this definition is 
in line with the biological point of view: mapping sequences into shapes rather than fixing some 
shape and then to consider its sequences. It now can be asked what the right choice of H should 
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1 2 3456799 10 
□ ACCCAD AAA 
B CAAACBCCC 



FIGURE 4. Combinatory maps: the base graph Ji is displayed on the l.h.s.. The 
r.h.s. shows two shapes Si and S2 with two particular sequences that are contained 
in their respective preimages. For both sequences the shapes are maximal, i.e. not a 
single 3-f-edge can be drawn without violating base pairing rules, here {{A, B}, {D, C}, 
{D,B}}. 

be and how robust the respective conclusion are. As for dependency on H the answer is that it 
a.s. (almost surely in the sense of random graph theory, i.e. in the limit of long sequences) depends 
on the number of edges, only. Therefore, the choice of H = J£ is not critical for the validity of 
the main results. To understand why, we consider a generalization of the concept of combinatory 
maps, i.e. combinatory maps induced the random graph G„ jP (the random graph in which each edge 
is selected with independent probability p) . In the sub-critical phase these random combinatory 
maps a.s. produce, modulo constants, all properties of the maps induced by JC (Theorem[2]). 

Theorem 23 (Neutral networks) Let p n = ±=£, [3 < \/2 and suppose to n tends to 00 arbitrarily 
slowly and $G n p * s 0. random combinatory map. Then there exist with high probability at least [3 n 
shapes § with the following two properties: 

(I) the set of all sequences mapping into § has a connected component of size at least (V2) 

(II) the set of all sequences mapping into S percolates, i.e. has diameter n — u n . 

The great advantage of choosing H = 'K is the simplicity and algorithmic nature of all proofs. 
We can explicitly construct all paths involved by diagram chasing. In contrast, the proof of the 
above result is based on a non trivial analysis of tree components in the random graph G n ,p- We 
have the following situation: let If be a graph over {1, 2, . . . , n}, A = {A, B, D, C} and Q™ be the 
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generalized n-cube, Q™ , i.e. the graph over the sequences (x\, . . . ,x n ), where Xi € A and in which 
two sequences are adjacent if they differ in exactly one nucleotide. Let d(v,v') be the number of 
nucleotide by which v and v' differ. A component of a graph H is a maximal connected subgraph. 
We consider relations 3i over the abstract alphabet A = {A, B,D,C}, i.e. 3£ C A X A satisfying 
the following three conditions 



These conditions arc motivated from abstracting form 2-D and 3-D interactions of the phenylalanine 
tRNA and the hammerhead ribozyme [4] . In both molecules mutual interactions of 3-nucleotides 
are absent but multiple pair interactions are responsible for the tertiary structure. In view of 
eq. (12. ip and eq. (|2.2|) each relation can be viewed as a graph over {A, B,D,C} and obviously, 
eq. (|2.3p is equivalent to this graph being bipartite^. We will be particularly interested in the 

base pairing rule Jv represented as the graph A B D C i.e. we allow for the 

following interactions: {{A, B}, {D, C}, {D, B}}. In this sense our nucleotide interactions are 
more general than those of RNA secondary structures since, for instance, we can express coaxial 
stacking of helical regions and the formation of isosteric C • G — G triples [I] . We introduce the 
H-subgraph Hy_(v) having vertex and edge set given by 

(2.4) V Hx ( v ) = {1, • • ■ ,n}, and E Hx ^ = {{i, k} | {i, k} is an ff-edge and (xi, Xk) G 31} 

and call Hx.(v) a shape 8 and the mapping §h ■ Q r l — ► {§ | § = Hoi(v)} a combinatory map. 
Note that the above construction entails an implicit notion of maximality, i.e. a shape of a sequence 
{x\, ...,!„) is the maximal -H-subgraph which satisfies 3?^ for all 2-sets of coordinates {xi,Xj}, 
being a TJ-edge. In this sense a shape represents a saturated structure. As for 3"C, suppose 
first n is even. We set C n (l) to be the graph over {1, . . . , n} with edge set {i, i + 1} where the 
vertices are labeled modulo n. Let a n some permutation of n- letters, we then set C n (a n ) with edges 
{a n (i), <7 n (i + I)} and J£ = C n (a n ). Next assume n is odd. Then we select an arbitrary element of 
{1, . . . , n}, say u and define 3t = C n _i((7 n _i) U {u} i.e. the graph with edges {a n -x(i), cr n -i(i + l)} 
for i ^= u and i+1 ^ u, where a n -i is an arbitrary permutation of {1, . . . , n} \ {u}. To summarize 
we have 



For instance, it is easy to check that the relation implied by all Watson-Crick base pairs 
(i.e. {(A,U),(U,A),(G,C),(C,G)}) and {(G,U),(U,G)}, satisfy conditions eq. pli eq. |g2} and eq. JZ3t . 



(2.1) 
(2.2) 
(2.3) 



(x,y) e 3? <^ (y,x) e 31 
(x,y)£% x^y 
Vx^z (x, y) G K A (y, z) G 3? (x,z)£3l. 



(2.5) 




) U {u} for n odd . 



for n even 
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3. Shapes 



In this section we answer the following basic questions: 

(1) What is the relation between base pairing rules and the resulting molecular shapes? 

(2) How many shapes does a combinatory map have? 

(3) Are there "many" shapes with large sets of sequences folding into them? 

All of the above properties are central for RNA secondary structures and none of them can be 
answered analytically, despite the fact that we have generating functions for RNA secondary struc- 
tures. For instance, it is impossible to assess a priori how many secondary structures have an 
actual sequence folding into them. The number of RNA structures that actually occur as mini- 
mum free energy structures can be much smaller than the total number. For n = 16, due to finite 
size effects for the RNA folding, only 63% of the possible RNA structures are realized as minimum 
free energy structures [3]. 



Let us begin by providing some more background: graph H' is called an induced subgraph of H iff 
there exists some set M C {l,...,n} such that Eh> = {{i,j} | {i,j} £ Eh f\i,j £ M}. Intuitively, 
induced subgraphs come from vertex sets and are far more restricted that arbitrary subgraphs. We 
now give a simple example of the fact that not every bipartite subgraph of a shape is a shape. For 
instance, consider -d H : Q\ — ► {H 1 < H} where 



(3.1) 



H = 




and Hq 




where the dotted lines represent missing edges. Clearly, H is bipartite and it is easy to check 
that indeed H = H(D, C, D, C, D, C), H holds. Therefore H is a shape but Hq is not. Every 
sequence realizing Ho has necessarily either A at 1, and C at 4 or vice versa. In the first case D 
is necessarily at 3 and 5, which leaves no valid choice for 6. The second case follows analogously. 



This is insofar remarkable since making the universal graph H (being responsible for all interac- 
tions) more complex can simply imply that not all of its subgraphs can be folded by sequences. 
This is due, as the example indicates, to the nature of the base pairing rule and shows clearly 
that both: H and 3? determine what is a shape and what is not. For simple base graphs, like for 
instance 3i, the lemma below shows that any subgraph (eq (|2.5|) ) is a shape. What we can deduce 
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from this is (a) there exist many shapes and (b) % is so simple that it is indeed only that is 
relevant for the shapes. The result is 

Lemma 1. Suppose H is an arbitrary combinatorial graph over {1, ... ,71} . 

(a) For any relation 31 any shape § is bipartite. 

(b) For the relation 5U and arbitrary base graph H , any induced, bipartite subgraph of H is a shape. 

(c) For the relation 3v and the base graph 3~C any 'M-subgraph H' is a shape. 

Since any HC-subgraph is a shape we have for instance for sequences of length 16 exactly 2 16 — 65536 
different shapes in difference to only 274 RNA secondary structures realized by the minimum free 
energy folding analyzed in [9]. This seems to indicate a vast difference between combinatory 
maps and RNA secondary structure folding, however, closer inspection reveals that in fact most 
of these structures are very "rare" , i.e. only a few have large preimage sizes. To understand what 
is happening we present in Figure [5] the data on the complete mapping from sequences of length 
16 into subgraphs of the cycle "K\q. We plot the logarithm of the preimage sizes of a combinatory 
map over the logarithm of the rank. We can deduce from Figure [5] that there are 393 shapes 
with a preimage of size greater than 0.5 x 10 6 . The data on RNA secondary structures in [5] 
show that there are 132 RNA minimum free energy structures with this property. Figure [5] shows 
that combinatory maps exhibit 393 shapes with a preimage of size greater than 0.5 x 10 6 . As 
for RNA secondary structures the data in [9] show that there are 132 RNA minimum free energy 
structures with this property. But what happens for larger sequence length? The asymptotics of 
RNA secondary structures [Ml [16] shows that the number of RNA secondary structures, S2(n), 
satisfies S2(n) ~ Kn~ia n where 1.8488 < a < 2.64, depending on what one considers a "realistic" 
secondary structure. In comparison a combinatory map produces (Lemma[T]) 2" shapes. Therefore 
combinatory maps produce a total number of structures which is, for large ro, in a comparable 
size-range. 

The above observations motivate the question about the number of shapes with large preimages 
[7]. For notational convenience let 



We next prove that there are many shapes with large preimages 

Lemma 2. Suppose the relation Jtf and the base graph !K are given, then there exist at least 



(3.2) 




and 





sequences folding into them. 
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FIGURE 5. A double logarithmic plot (base 10) of the preimage sizes of a combinatory 
map for n — 16 as a function of the rank. The underlying graph !Ki6 is displayed in 
the lower right. The plot shows that there are a few shapes with large and many shapes 
with very small preimages. This observation is in complete analogy with RNA secondary 
structure folding maps. 

Lemma[2]sets the stage for the further investigation of how this set of sequences is organized. Now, 
knowing that there are exponentially large sets of sequences realizing particular shapes what can 
be said about their organization? Are they randomly distributed or clustered in sequence space? 
What is their graph-structures as induced subgraphs of sequence space? 



4. Neutral networks of Combinatory Maps 



One difficulty in the context of neutral networks is that it is practically impossible to prove they 
exist. Exhaustive enumeration of sequence spaces is limited to small sequence length n < 20 for 
four letter alphabets [11] and the results are of limited value since finite size effects distort the 
picture. In case of A, U, G, C-sequences about 60% of all sequences fold into the open structure 
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[9]. Several attempts have been made to derive somewhat local criteria whether neutral networks 
exist |10j , where the key idea is the probing for paths adopted from the actual random graph proof 
in [211 [22]. In this context local parameters are the only quantities that give some clue about 
the existence and properties of neutral networks. In case of neutral networks modeled as random 
graphs, it is the number of neutral neighbors that controls global properties like connectivity and 
density of the corresponding neutral network. A neutral neighbor is a neighboring sequence which 
folds into the same structure and the fraction [20] 

(4.1) A* = 1 — "Tcr 1 

is actually the threshold value for connectivity and density. In the following we can derive for 
combinatory maps the entire distribution of neutral neighbors of particular shapes. The result 
is actually not "local" at all and entails detailed information about the entire preimage of these 
shapes. To be precise we can actually derive the underlying rational generating function using the 
transfer matrix method of enumerative combinatorics. We study the quantity A§ M (to) being the 
number of sequences folding into the particular shape S having exactly m neutral neighbors. Our 
result reads 

Theorem 1. For arbitrary shape §m, where M C {1, . . . , k} denotes its set of isolated nucleotides, 
we have 

(4.2) VtogN: A SAf (TO)>Ac 2fe (TO) 

and the generating function of Xc 2k ( m )r F{ x i v) — Tlk>2 S m ^C 2fc (m)x m y 2k is given by 

2(-4x 3 y 6 + 2x 2 y e + 3x 2 y 4 - 5 + 4x 2 y 2 + 8xy 2 - 6x 3 y 4 + 2x 4 y 6 ) 
( ' ' i x ^y>- -2x 3 y 6 + x 2 yS + x 2 y 4 - 1 + 2xy 2 + x 2 y 2 - 2x 3 y 4 + x 4 y & ' 



The bi-variate function F(x, y) provides detailed information about neutral neighbors, of the entire 
preimages of shapes Sm- For instance, Taylor expansion of eq. (|4.3|) yields 

F(x, y) = 10+ {2x 2 + Ax)y 2 + (12a; 2 + 2x 4 )y 4 + (6a; 2 + 16a; 3 + 12a; 4 + 2a; 6 )y 6 + 0(y 8 ) 

and the term (12a; 2 + 2x 4 )y 4 shows that for n = 4 there are at least 12 vertices with 2 and 2 vertices 
with 4 neutral neighbors. Likewise, for n = 6, there are at least 6 with 2, 16 with 3, 12 with 4 
and 2 vertices with 6 neutral neighbors. In addition eq. (|4.2I) guarantees that itself provides a 
lower bound on the numbers of neutral neighbors. I.e. we can pinpoint a specific reference shape 
providing key information about the neutrality of the entire combinatory map. 
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FIGURE 6. The distribution of neutral neighbors for the entire preimage of the "refer- 
ence" shape S = ${40, where n = 40 denotes the sequence length. We plot the fequency 
(y-axis) of numbers of neutral neighbors (x-axis) obtained from Theorem [1] Note that 
the degree of a vertex in Qf is 120, showing that the lower bounds on the fractions of 
neutral neighbors range between 13% and 24% . 

In the previous section we have shown that there are many shapes with large preimages. However, 
it is not obvious what the graph structure of these preimages is. In this section we will study this 
structure in detail and prove two remarkable properties. First there are many shapes with sets 
of sequences having diameter n i.e. there exist two sequences which differ in all nucleotides both 
of which map into the particular shape. This finding is tantamount to percolation and indicates 
that the preimages are indeed extended and not confined in some "local" region of sequence space. 
Secondly we prove that the preimages of exponentially many shapes contain large connected com- 
ponents. In other words we can actually prove the existence of neutral networks for sequence to 
shape maps, i.e. many shapes have sets of sequences in which there exists a component of size 
> (V2) and of diameter n. 

Theorem 2. (Neutral networks) Suppose the relation ffi and the base graph !H are given. Then 
there exist at least (\/2) many shapes S with the properties 
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FIGURE 7. Neutral network. Sequence space is represented as lattice and the neutral 
net is an induced subgraph (bold edges). We label the pairs of sequences representing 
antipodal pairs by (A, B) and (C, D). The two key properties of neutral nets are their 
connectivity and percolation. 

(I) the set of all sequences mapping into S has a connected component of size at least fi^_ + //". 

(II) the set of all sequences mapping into § percolates, i.e. has diameter n. 



In comparison with the corresponding result for random graphs we observe that the neutral net- 
works are in fact slightly bigger and the diameter indeed equals n. This is a result from the fact 
that the simpler graph J{ allows for a different proof, which is very algorithmic. In fact the proof 
indicates how to explicitly obtain these paths of diameter n, while the random graph analogue 
can only produce their existence. In this sense both constructions complement each other. To 
illustrate the idea of Theorem [5] we consider the cycle "Ha and the shape S = "K^. Then we have 
the following situation (using the notation of the proof of Theorem [2| 

a = (C,D,C,D) and C 4 ((C, D, C, D)) = C* 4 . 

Theorem [5] guarantees the existence of the antipodal sequence a = (B,A,B,A) and a path 
connecting a and a obtained via the steps (a), (b) and (c). Explicitly this path for § from a 
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to a is given by 



D 



B D B D 



D 



D 



D 



B 



B 



D 



B 



B 



A B 



stcp(a): replace C by B 



stcp(b): replace D by A 



Theorem [5] holds for many shapes. For instance the neutral path for §{i}, which has length 
diarr^Ql) = 4 and which connects the sequences a* 1 } is given by 



D 



D 



D 



D 



B 



D 



B 



B 



stcp(a):rcplacc C by B 



stcp(b): replace D by A 



stcp(c): replace A by C 



5. Appendix 



Proof of Lemma [T] To show (a) we first prove that for any relation satisfying eq. ()2. 1|) . eq. (|2.2fl 

and eq. (|2.3|) a shape S is bipartite. 

Claim. Any closed walk in § has even length. 

Since § is a shape we have § = H(v), whence for any closed walk w — (wi, W2 ■ ■ ■ , w r , w±) in § there 
exists at least one sequence x — (x Wl , x W2 , . . . , x Wr ,x Wl ), where Xh € {A,U,G,C}. Therefore 
there exists an injection 

{(x ) | w is a closed walk in §} — ► {7 | 7 is a closed walk in G(X)} 

The idea is to show that 

{7 I 7 is a closed walk in G(3?) of odd length} = . 

Suppose 7 is a closed walk of minimal, odd length in G(3V). Obviously, there are only 4 vertices in 
G(9£). We can conclude from this that 7 contains a cycle of length 3 which is in view of eq. (|2.3I) 
impossible, whence the claim. 

We next select an arbitrary vertex, i S {1 . . . n} and color all vertices in even distance to i blue 
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and all vertices in odd distance red. Suppose this procedure leads to two monochromatic adjacent 
vertices j, r. Then we obtain a closed walk containing i, j and r of odd length. By induction we can 
conclude that this walk contains a cycle of odd length, which is impossible, whence § is bipartite 
and assertion (a) follows. 

Next we show (b) by constructing a vertex v — (xi, . . . , x n ) £ Q\ with the property Hji nc (v) = H' , 
where H' is an arbitrary induced, bipartite subgraph of H . Since H' is induced in H there exists 
some set M C {l,...,n} such that Eh> = {{i,j} \ {hj} S Eh A i,j S M}. First, for all 
coordinates Xj where j $ M we set Xj — A. Then by definition of 3^ for i, i' £" M, {xi, x^} 3^ 
holds. Since H' is bipartite there exists for the vertices j £ M a bi-coloring (red/blue) such that 
no two if '-adjacent vertices are monochromatic. Suppose Xj,Xk are coordinates where j,k € M. 
We choose a bi-coloring (red/blue) and set Xj = D for j being colored red and Xk = C for k being 
colored blue, respectively. In view of (D, C), (C, D) £ Ov , we can conclude that for j,k £ M and 
{j, k} £ H we have {xj,x k } £ Since (A, C), (A, D) ^ 3?* we derive that for i M and j £ M, 
{xi, Xj} £" 01' holds. Therefore {{xi, . . . , x n )) = H' i.e. any induced bipartite subgraph of H is 
a shape. 

Next we show (c), i.e. for "K (eq (|2.5p ) any H' < "X is a shape. We proceed by explicitly constructing 
a vertex v = (x\, . . . , x n ) £ Q\ with the property 3C^t (v) = H'. W.l.o.g. we can assume that n is 
even since the isolated point u does not contribute to the IK-shapes. Then we have 3{ = C2k and 
Vc 2k — {1, . . . , 2k}. We label the iJ'-vertices {1, . . . , 2k} clock- wise such that the (clockwise) first 
vertex in one largest -ff'-component is 1. Then H' corresponds to a unique sequence of components. 
We assume now x t £ {A, B} and label all iJ'-vertices except of those contained in the component 
proceeding vertex 1. We set inductively 



where B = A and A = B. As for the labeling of the component preceding the component 
containing vertex 1, we start with Xj = C and continue inductively Xj+i = D, xj + 2 = C, . . . . This 
procedure results in a labeling compatible with H' since for {i — 1, i} £ H' we have either {C, D} 
or {A, B} and for {i — 1, i} H' we have {A, A}, {B, B} and {A, C} or {B, C} (at the beginning 
of the last component) and {D, A} or {C, A} (at the end of the last component). Accordingly we 
obtain a sequence Vh' with the property 3-C(vh') = H'. □ 

Proof of Lemma [2] By definition, there exists a unique component of "K which is a cycle of even 
length, C2k- contains for n even all and for n odd all but one 3f- vertices. Suppose C2k contains 



(5.1) 
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the vertices {i u j x , i k ,jk}, where i x < j x < i 2 < . . . ik < jk- 

Claim. The number of 2fc-tuples (x i± , Xj 1 , . . . , Xi k , Xj k ) such that C2k((xi 1 , Xj 1 . . . , Xi k , Xj k )) = C 2 k 

i.e. (x h , x h , . . . , x ik , x jk ) e &cl k ( C M ) is S iven b y 

(5.2) 2 + fi 2 _ k ) . 

To prove the claim we observe that 3?t induces the digraph D K t defined as follows: 




and A D%i = 



A B 

/0 1 

1 

1 

V o o 



D 



1 



1 



c 

o\ 



1 

0/ 



The number of 2fc-tuples (x il ,x jl . . . ,x ik ,Xj k ) with the property C 2 k({xi 1 , Xj 1 , . . . ,x ik ,x ]k )) = C 2 k 
is equal to the number of closed walks of length 2k in . Indeed, in order to obtain such an 
2fc-tuple we fix an index, ii, say. Then we start with successively A, B, D and C and form 
of closed walks of length 2k in starting and ending at A, B, D and C. All these walks 
are counted respectively, since we have labeled graphs. The number of closed walks of length 
t in D% NC starting and ending at i is given by (A e D whence the number of all closed 

™ N C 

walks of length I is simply Tr(A e D f ) = X)i(^i> t )*,»- Fnmi the definition of the characteristic 
polynomial, i.e. Tr(A e D ^) = w[ + ■ ■ ■ + lu^, where w\, . . . ,u r are the eigenvalues of 4D Kt (note 



4). We obtain 



E 

l>0 



[(i + (-i)')(m5- + a*1)]^ 



and the claim follows. 

Suppose (xi 1 , Xj 1 , . . . , Xi k , Xj k ) <G "&q 2 (C2fe) and M C {1, We consider the involution 

r: A — > A, where r(A) = B and r(D) = C and set 



(5.3) Im (x^, xj 1 . . . , Xi k , xj k ) — (yi 1 ,Xj 1 . . . ,Vi k ,Xj k ), where yt e — 



r(xi e ) for in e M 
x le for ij>^M. 



Claim. There exists a bijection 

(3: {Mc{l,2,...,fc}}^{S M }, M^S M 
where §m is obtained by deleting any two C2fc-edges incident to the vertices ih € M and 

(5.4) V(x il ,X jl ...,Xi k ,X jk )€<&cl k ( C 2k); §M = C 2 k{lM(Xi 1 ,X jl ...,X ik ,X jk )) . 
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Suppose M ^ M' then w.l.o.g. we can assume that there exists some index !/, 6 M \ M' , 
i.e. ih is isolated in Sm but not in §m>- Since jh-i and jh are both in §m and Sm' we have 
{jh-i>ih}, {jh,ih}, G §M' but not in Sm, whence Sm and Sm' are different shapes. Since Sm is an 
induced bipartite subgraph, Lemma Q] implies that any Sm is a shape. When ih <E M the following 
diagram 



X 3h-1 | x i h | X 3h X 3h-1 r ( x i h ) X ih 

shows that Im has the property: for arbitrary 

(Xi ± , Xj 1 . . . , Xi k , Xj k ) £ $C 2 k (^2fc) 

the shape C2k{lM(xi 1 ,Xj 1 . . . ,Xi k ,Xj k )) differs from Cik exactly by deleting the two C2fc-edges 
incident to all it £ M; explicitly 



0- 



and the claim is proved. The claim implies that Im induces the injection 
(5.5) Im ■■ &cl k ( C ™) — ► ^(Sa/), {x h , x h . . . , x ik , Xj J m- I m ( Xil , x h . . . , x ik , Xj J . 
This injection allows us to relate the sets (C2fe) and (Sm) and in particular 
(5-6) fe^l < fe(§M)| ■ 

Since M C {1, . . . , k} was arbitrary we can conclude that there are 2 k subsets and hence 2 k distinct 
shapes Sm- Hence there exist at least 

n-l 



2 k >(V2 



shapes S with the property 



|^ 1 (S)|>|^ 1 (W)|>2(4 fe +^) 
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In case of n ^ mod 2 we have exactly one more isolated point, i.e. 

(5.7) i^- 1 (§)i>8( M r 1 + M r r 1 ) 

and since 4 > + fi_) the lemma follows. □ 

Proof of Theorem [2] We first prove that at least CV?)™ shapes § have a preimage 

with 

diameter n. We will work with the particular set of shapes {§m | M C {1, . . . , k}}, introduced 
in Lemma [5] and prove that all of them have a component of size > ^™ + /i™ > (V^)™ and 
diam('i?^ ( - 1 (§)) = n. Let C2fe be the Jt-cycle, which contains all UC-vertices for n even and all but 
one HC-vertices, for n odd. Let Vc 2k = {ii, Ji, ■ ■ ■ , ik,jk\, where i\ < j\ < i% < . . . ifc < j^. 
Claim 1. Let M C {1, . . . , fe}, then there exist at least 2 fe shapes Sm over Q| fc such that 



(5.8) diam^-^SM)) 



for n = mod 2 
1 for n mod 2 . 

We first show that for each M there exists a pair of antipodal sequences, i.e. (a M ,a M ) with 
d(a M , a M ) = 2k and a path (a M , tof , . . . , u$_ l5 a M ) such that ?9 C2fc (u?f ) = S M - 



(5.9) a M = (ajf , a it . . . , a%,a jh ), where a ih = D, and - w 

In particular we have a = (C, D, . . . , C, D). Then §m = C2k(a M ), i-e. §m i s the shape obtained 
by removing for each ih G M the two incident C2fe-edges. Next we define an antipode a M , i.e. an 
element of Q| with the property d(a M ,a M ) = 2k as follows 



A for i h € M 
C otherwise. 



(5.10) a M = (a|f , a h . . . , a£f , a jk ), where a 0h = A, and af h 



C for i h e M 
B otherwise. 



We can transform a M into a by successively changing exactly one coordinate in three steps: 
(a) replace (in any order) for ih $ M successively all cii h — C by B, (b) replace (in any order) 
successively all a Jh = D by A and finally (c) substitute (in any order) for all ih S M a,i h — A by 
C. 

This proves that there exists a Q^-path 

(5.11) <,...,«$_!, a") 
connecting a M and a M , such that 

(5.12) Vl<i<2fc-1, C 2fc K M )=§M. 
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I.e. all intermediate steps of the path are mapped by into the shape §m- As shown in Lcmma[2] 
there are 2 fc different shapes §m induced by the subsets M C {I, . . . , fc}, whence Claim 1. 
In case of n = mod 2 we derive 2 fe = (\/2)"- In case of n ^ mod 2 there exists exactly one 
vertex u which is isolated in !K. Then we simply add the isolated point u to each shape §m and 
shall in the following identify these new shapes with Sm- Then (i?^ (§m)| = 4|i?^ (Sjwr)j- We can 
choose a u — A and a u = B and 



a u — (o>i 1 j Cij! • • • j o>uj ■ ■ ■ i a i k ! a jfe ) 

~M _ (~M ~ ~ ~M ~ \ 

a U — \ a i 1 1 jl ■ ■ ■ 7 fl M ! ■ ■ ■ 7 a i fc 7 a j k ) 

satisfy d(aff , a^f) = n and there exists a QJ-path (a^ 7 , . . . , w%l,aff ) connecting aff and a^f , 
with the property 

(5.13) Vl<z<2fc, C 2k {wf ) = § M • 

Therefore we have proved that at least (v^) shapes § m have a preimage i?^- 1 (§m) with diameter 
n. 

Claim 2. 

(5.14) | {S M I |e(^(§))| > M+ fc + V 2 - } I > 2 fc . 

To prove the Claim 2 we first observe that i?^- 1 (IK) has exactly two components of equal size 
(5.15) 

Indeed, any vertex v € •d7^{'K) can be transformed into either 

a = (C,D,C,...,D,C), or b = (D, C, . . . , D, C, D) 

successively using the two steps (I) replace (in any order) all A by D and (II) replace all (in any 
order) B by C. Hence there exist exactly two components and the map 

<r(xi 1 , Xj 1 , ... , Xi k , Xj k ) = (xj k , Xi 1 , ... , Xj k l , Xi k ) 

is a bijection between them, whence they have equal size. Eq. (|5.15[) then follows from eq. (|5.2p in 
Lemma [2j We next claim that the mapping Im of eq. (|5.3p is in fact an injective graph morphism 

(5.16) I M : &cl AC 2 k) — ► i?c! fc ( § Af), {x h ,x jl ...,Xi h ,Xj h )^I M (x il ,x h ...,x ik ,x jk ). 
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I.e. for two adjacent vertices v,v' € $c 2 ' ^ ne vertices Im(v) and ImW) are adjacent. To prove 
this we consider the diagrams 




The above diagrams represent the two scenarios for two adjacent vertices v,v' S ^.(C^k)- 
I.e. if u and v' are both contained in "&q (C2k) and differ in Xi h and x^ h then we have either 
Xj h _ 1 = Xj h = B and Xi h = D and x' ih = A or a^j h _! = Xj h = D and Xi h = B and a;^ = C. 
Suppose we apply Im arid S M, then the resulting vertices Im(v) and Im(v') are again adjacent, 
whence Jm is an injective graph morphism. Accordingly, Im maps components into components, 
from which we can conclude that for each M C {1, . . . , k} the shape Sm has a component of size 
^ + ^i 2k and Claim 2 is proved. 

In case of 2k — n the assertion follows directly. For n odd we have to repeat the argument in 
Lemma [2] where we considered the isolated point u in eq. (|5.7p . Since we used the same set of 
shapes {§m | M C {1, . . . , k}} for both claims the theorem follows. □ 



Proof of Theorem [T] It is clear that we can restrict our analysis to the case n = mod 2, 
i.e. "K = C2k, since the isolated point contributes always 4 neutral neighbors for any shape. Eq. (|4.2p 
is a direct consequence of 

Im ■ $cl h ( c zk) — > $cl k ( § m ) , {Xi, , x n ...,x lk , x jk ) i-> I M {x tl ,x jt ..., x ik , x 0k ) . 



being an injective graph morphism. Thus it suffices to prove eq. (14. 3|) . We observe that for 



v = (x H ,x h ,..., x ik , x ]k ) h-> (t h ,t h ,..., t ik , t jk ), where t s 



(xi h , a; Jh , ) for s = jh 
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is a bijection, where h is considered modulo k. Hence every v € i?^ 2 (Cafe) can be uniquely 
decomposed into a sequence of triples. Since v € t?^ 2 (C^fc) there are exactly the following ten 
triples 



V D = {ABA,ABD,BAB,BDB,BDC,DBD,DBA,DCD,CDC,CDB} 



and setting 



Ed — {{{ x jh-i> X ih' X jh)f( X ih> X 3h' X ih+i)) I ( x jh-i > x ih' x jh) ^ Vd} 



we obtain the digraph D. Suppose we are given v,v' € i? c k (^k) with d(v,v') = 1 then we have 
the following alternative 




D 



B 





D 




D 



B 



(xj h l ,Xi h ,x jh ) 



The idea is now to count all triples i.e. (xj hl ,Xi h ,Xj h ), (xj h _ 1 , Xj h _ 1 , Xi h ) contained in = 
{BAB, BDB, DBD, DCD} in d c \ k (C 2fe ). Let next R[x] be a polynomial ring and w : E D — ► R[x] 
a function given by w(e) = x iff the arc e has terminus t£ 9, otherwise w(e) = 1. If T = e\e^ ■ ■ - 
is a walk of length i in Ejj, then the weight of T is defined by w(T) = w{ei)w{e2) ■ ■ ■ w(ei). Intro- 
ducing the formal variable x in w allows us to count the triples in O within some v € §q (C2fe). 
The number of closed walks of length i in D is Ylvev Ad = ~^ r (A D ), where Ad is the adja- 
cency matrix of -D. 

Suppose 5 is a p x p matrix and {?7i}f=i are all the eigenvalues of J5, then we have det£> = Y[i Vi- 
Let {$i}f_i and {u>i}^ =1 be all the eigenvalues of / — yA and A respectively, then we have 
£j = 1 — ycjj, where 1 < i < p. For the set of all the nonzero eigenvalues of A, {wj}|" =1 we 
derive det(I- yA) = ElLiX 1 -W*>i)- We set = det(I-yA) and have p = 10 = | V D \, A = A D 
and r = 6 for x =/= I, whence 

(mt) E W - EM + + 4V - E - w 1 - 

£>1 £>1 i=l iy ^ Kyj 

After some computation we derive Q{y) = 1 — 2xy 2 — x 2 y 2 + 2x 3 y 4 — x 4 y e + 2x 3 y 6 — x 2 y 6 — x 2 y A 
and the lemma follows from eq. (|5.17|) . □ 
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